Nextflow

Scalable, Sharable and Reproducible Computational Workflows across Clouds and Clusters

Rad Suchecki (CSIRO)

The challenge

  • Large analysis workflows are fragile ecosystems of software tools, scripts and dependencies.

  • This complexity commonly makes these workflows not only irreproducible but sometimes even not re-runnable outside their original development environment.

  • Even small workflows affected

Let others and your future self(!)

  • reliably re-run your analyses
  • trace back origins of results

Re-running pipelines

  • new data (e.g. additional samples)
  • updated software
  • different compute environment (cloud?)
  • errors found
  • new ideas
  • and any combination of the above

Push-button workflow wish-list


  • version controlled
  • container-backed
  • seamless execution across different environments (if computationally feasible)
    • laptop/server/cluster/cloud
  • sharable
    • effort required for someone else to use it

Nextflow


  • Reactive workflow framework
  • Domain specific programming language
  • Aimed at bioinformaticians familiar with programming


  • Designed for seamless scalability of existing tools and scripts
    • Implicitly parallelized, asynchronous data streams
  • Separation of pipeline logic from the definitions of
    • software environment (on $PATH, modules, binaries, conda, containers)
    • execution environment (laptop, server, cluster, cloud)

https://www.nature.com/articles/nbt.3820/

Credit: Evan Floden

Nextflow building blocks

Processes (1-to-many tasks)

  • safe and lock-free parallelization
  • executed in separate work directories
  • easy clean-up, no issue of partial results following an error

Channels

  • facilitate data flow between processes by linking their outputs/inputs
  • a suite of operators applied to channels to shape the data flow
    • filtering, transforming, forking, combining…

https://www.nature.com/articles/nbt.3820/

Parallelisation

Credit: Evan Floden

  • Independent tasks will run in parallel (Ts & Cs apply)
  • Reduced overallocation of resources

Getting started

Required

  • POSIX compatible system (Linux, Solaris, OS X, etc)
  • Bash 2.3 (or later)
  • Java 8 (or later)

Install

curl -s https://get.nextflow.io | bash

Software you want to run

  • Available on PATH or under bin/
  • via Docker
  • via Singularity
  • via Conda
  • via Modules

Hello world syntax

#!/usr/bin/env nextflow
echo true

cheers = Channel.from 'Bonjour', 'Ciao', 'Hello', 'Hola'

process sayHello {
  input: 
    val x from cheers
  script:
    """
    echo '$x world!'
    """
}

Hello world run

N E X T F L O W  ~  version 19.01.0
Pulling nextflow-io/hello ...
 downloaded from https://github.com/nextflow-io/hello.git
Launching `nextflow-io/hello` [nasty_wozniak] - revision: a9012339ce [master]
[warm up] executor > local
[f8/52866d] Submitted process > sayHello (1)
[89/8e3d0a] Submitted process > sayHello (2)
[5a/12ca76] Submitted process > sayHello (3)
Bonjour world!
Ciao world!
[1b/e487d8] Submitted process > sayHello (4)
Hello world!
Hola world!

Hello world how?

 project name: nextflow-io/hello
 repository  : https://github.com/nextflow-io/hello
 local path  : /home/rad/.nextflow/assets/nextflow-io/hello
 main script : main.nf
 revisions   : 
 * master (default)
   mybranch
   testing
   v1.1 [t]
   v1.2 [t]

Alternatives

Hello world shared pipelines

  • Run specific revision (commit SHA hash, branch or tag)
N E X T F L O W  ~  version 19.01.0
Pulling nextflow-io/hello ...
 downloaded from https://github.com/nextflow-io/hello.git
Launching `nextflow-io/hello` [trusting_euler] - revision: baba3959d7 [v1.1]
[warm up] executor > local
[34/9cacca] Submitted process > sayHello (2)
[71/31d426] Submitted process > sayHello (1)
[e0/c7cdb4] Submitted process > sayHello (3)
Ciao world! (version 1.1)
Bojour world! (version 1.1)
[c2/40df8c] Submitted process > sayHello (4)
Hello world! (version 1.1)
Hola world! (version 1.1)

Command line syntax basics

  • Single dash (-) for Nextflow params

-resume prevents re-running of tasks when relevant inputs/scripts unchanged

  • Double dash (--) for pipeline params (defined by you)

Filename sample.fastq.gz will be available to NF under params.input, similarly the value bar will be accessible under params.foo.

Flowchart

Logic (and input data) of this example workflow is adapted from EMBL-ABR Snakemake webinar by Nathan Watson-Haigh

Example workflow

#!/usr/bin/env nextflow

//Build link to reference
referenceLink = params.ref.base_url + params.ref.chr + ".fsa.zip"

//Take accessions defined in nextflow.config.
//Use --take N to process first N accessions or --take all to process all
accessionsChannel = Channel.from(params.accessions).take( params.take == 'all' ? -1 : params.take )

//fetch adapters file - either local or remote
adaptersChannel = Channel.fromPath(params.adapters)

process download_chromosome {
  tag { params.ref.chr }

  //Prevent re-downloading of large files
  storeDir { "${params.outdir}/downloaded" }  //use with care, caching will not work as normal so changes to input may not take effect
  scratch false //must be false otherwise storeDir ignored

  input:
    referenceLink

  output:
    file('*') into references

  script:
  """
  wget ${referenceLink}
  """
}

process bgzip_chromosome {
  cpus '2' //consider defining in conf/requirements.config based on process name or label
  tag { ref }

  input:
    file ref from references

  output:
    file('*') into chromosomesChannel

  script:
  """
  unzip -p ${ref} \
    | bgzip --threads ${task.cpus} \
    > ${ref}.gz
  """
}

process bgzip_chromosome_subregion {
  input:
    file chr from chromosomesChannel

  output:
    file('subregion') into subregionsChannel

  script:
  """
  samtools faidx ${chr} ${params.ref.chr}:${params.ref.start}-${params.ref.end} \
    | bgzip --threads ${task.cpus} \
    > subregion
  """
}

process extract_reads {
  tag { accession }
  storeDir { "${params.outdir}/downloaded_reads" }  //use with care, caching will not work as normal so changes to input may not take effect

  input:
    val accession from accessionsChannel
    //e.g. ACBarrie

  output:
    set val(accession), file('*.fastq.gz') into (extractedReadsChannelA, extractedReadsChannelB)
    //e.g. ACBarrie, [ACBarrie_R1.fastq.gz, ACBarrie_R2.fastq.gz]

  script:
  """
  samtools view -hu "${params.bam.base_url}/${params.bam.chr}/${accession}.realigned.bam" \
    ${params.bam.chr}:${params.bam.start}-${params.bam.end} \
  | samtools collate -uO - \
  | samtools fastq -F 0x900 -1 ${accession}_R1.fastq.gz -2 ${accession}_R2.fastq.gz \
    -s /dev/null -0 /dev/null - \
  && zcat ${accession}_R1.fastq.gz | head | awk 'END{exit(NR<4)}' \
  && zcat ${accession}_R2.fastq.gz | head | awk 'END{exit(NR<4)}'
  """
}

process fastqc_raw {
  tag { accession }

  input:
    set val(accession), file('*') from extractedReadsChannelA

  output:
    file('*') into fastqcRawResultsChannel

  script:
  """
  fastqc  --quiet --threads ${task.cpus} *
  """
}

process multiqc_raw {
  input:
    file('*') from fastqcRawResultsChannel.collect()

  output:
    file('*') into multiqcRawResultsChannel

  script:
  """
  multiqc .
  """
}

process trimmomatic_pe {
  echo true
  tag {accession}

  input:
    set file(adapters), val(accession), file('*') from adaptersChannel.combine(extractedReadsChannelB)

  output:
    set val(accession), file('*.paired.fastq.gz') into (trimmedReadsChannelA, trimmedReadsChannelB)

  script:
  """
  trimmomatic PE \
    *.fastq.gz \
    ${accession}_R1.paired.fastq.gz \
    ${accession}_R1.unpaired.fastq.gz \
    ${accession}_R2.paired.fastq.gz \
    ${accession}_R2.unpaired.fastq.gz \
    ILLUMINACLIP:${adapters}:2:30:10:3:true \
    LEADING:2 \
    TRAILING:2 \
    SLIDINGWINDOW:4:15 \
    MINLEN:36
  """
}

process fastqc_trimmed {
  tag { accession }

  input:
    set val(accession), file('*') from trimmedReadsChannelB

  output:
    file('*') into fastqcTrimmedResultsChannel

  script:
  """
  fastqc --quiet --threads ${task.cpus} *
  """
}

process multiqc_trimmed {
  input:
    file('*') from fastqcTrimmedResultsChannel.collect()

  output:
    file('*') into multiqcTrimmedResultsChannel

  script:
  """
  multiqc .
  """
}

process bwa_index {
  input:
    file(ref) from subregionsChannel

  output:
    set val(ref.name), file("*") into indexChannel //also valid: set val("${ref}"), file("*") into indexChannel

  script:
  """
  bwa index -a bwtsw ${ref}
  """
}


process bwa_mem {
  tag { accession }

  input:
    set val(ref), file('*'), val(accession), file(reads) from indexChannel.combine(trimmedReadsChannelA)

    output:
        file('*.bam') into alignedReadsChannel

  script:
  """
  bwa mem -t ${task.cpus} -R '@RG\\tID:${accession}\\tSM:${accession}' ${ref} ${reads} | samtools view -b > ${accession}.bam
  """
}

Example workflow

N E X T F L O W  ~  version 19.04.1
Launching `../main.nf` [mad_turing] - revision: f79dd55637
[warm up] executor > local
[skipping] Stored process > download_chromosome (chr4A)
[skipping] Stored process > extract_reads (ACBarrie)
[e9/8f8adb] Submitted process > bgzip_chromosome (iwgsc_refseqv1.0_chr4A.fsa.zip)
[33/4f4a24] Submitted process > fastqc_raw (ACBarrie)
[41/a6c033] Submitted process > trimmomatic_pe (ACBarrie)
[e1/31ebd6] Submitted process > bgzip_chromosome_subregion
[15/75bfce] Submitted process > fastqc_trimmed (ACBarrie)
[f8/aa87cc] Submitted process > multiqc_raw
[e7/708f93] Submitted process > bwa_index
[78/0c6a4f] Submitted process > bwa_mem (ACBarrie)
[07/6ea778] Submitted process > multiqc_trimmed

The work directory (1/2)

work
├── 07
│   └── 6ea7782bdd4b364ff26c34d1bcbce6
├── 15
│   └── 75bfce39613a171bc9d1bf36020ca0
├── 33
│   └── 4f4a2428ed6d0519ebf4c07af454dc
├── 41
│   └── a6c03378bb0a5b008905e0dd599156
├── 78
│   └── 0c6a4fbc0d1a2ac28baa1e8a1435c0
├── e1
│   └── 31ebd67396d6731681750b5be54245
├── e7
│   └── 708f935b311acc6f8133a081527431
├── e9
│   └── 8f8adba927938f235626b924afc564
├── f8
│   └── aa87cc036dc404ec680680451e617b
└── stage
    └── 6a

20 directories, 0 files

The work directory (2/2)

work
├── [4.0K]  07
│   └── [4.0K]  6ea7782bdd4b364ff26c34d1bcbce6
│       ├── [ 116]  ACBarrie_R1.paired_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/15/75bfce39613a171bc9d1bf36020ca0/ACBarrie_R1.paired_fastqc.html
│       ├── [ 115]  ACBarrie_R1.paired_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/15/75bfce39613a171bc9d1bf36020ca0/ACBarrie_R1.paired_fastqc.zip
│       ├── [ 116]  ACBarrie_R2.paired_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/15/75bfce39613a171bc9d1bf36020ca0/ACBarrie_R2.paired_fastqc.html
│       ├── [ 115]  ACBarrie_R2.paired_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/15/75bfce39613a171bc9d1bf36020ca0/ACBarrie_R2.paired_fastqc.zip
│       ├── [   0]  .command.begin
│       ├── [ 397]  .command.err
│       ├── [ 397]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.9K]  .command.run
│       ├── [  26]  .command.sh
│       ├── [ 221]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [4.0K]  multiqc_data
│       │   ├── [119K]  multiqc_data.json
│       │   ├── [ 835]  multiqc_fastqc.txt
│       │   ├── [ 413]  multiqc_general_stats.txt
│       │   ├── [ 12K]  multiqc.log
│       │   └── [ 344]  multiqc_sources.txt
│       └── [1.1M]  multiqc_report.html
├── [4.0K]  15
│   └── [4.0K]  75bfce39613a171bc9d1bf36020ca0
│       ├── [699K]  ACBarrie_R1.paired_fastqc.html
│       ├── [465K]  ACBarrie_R1.paired_fastqc.zip
│       ├── [ 113]  ACBarrie_R1.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/41/a6c03378bb0a5b008905e0dd599156/ACBarrie_R1.paired.fastq.gz
│       ├── [707K]  ACBarrie_R2.paired_fastqc.html
│       ├── [468K]  ACBarrie_R2.paired_fastqc.zip
│       ├── [ 113]  ACBarrie_R2.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/41/a6c03378bb0a5b008905e0dd599156/ACBarrie_R2.paired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.5K]  .command.run
│       ├── [  45]  .command.sh
│       ├── [ 225]  .command.trace
│       └── [   1]  .exitcode
├── [4.0K]  33
│   └── [4.0K]  4f4a2428ed6d0519ebf4c07af454dc
│       ├── [698K]  ACBarrie_R1_fastqc.html
│       ├── [460K]  ACBarrie_R1_fastqc.zip
│       ├── [  92]  ACBarrie_R1.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R1.fastq.gz
│       ├── [703K]  ACBarrie_R2_fastqc.html
│       ├── [472K]  ACBarrie_R2_fastqc.zip
│       ├── [  92]  ACBarrie_R2.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R2.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.4K]  .command.run
│       ├── [  46]  .command.sh
│       ├── [ 226]  .command.trace
│       └── [   1]  .exitcode
├── [4.0K]  41
│   └── [4.0K]  a6c03378bb0a5b008905e0dd599156
│       ├── [  92]  ACBarrie_R1.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R1.fastq.gz
│       ├── [154K]  ACBarrie_R1.paired.fastq.gz
│       ├── [1.0K]  ACBarrie_R1.unpaired.fastq.gz
│       ├── [  92]  ACBarrie_R2.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R2.fastq.gz
│       ├── [156K]  ACBarrie_R2.paired.fastq.gz
│       ├── [ 580]  ACBarrie_R2.unpaired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [ 882]  .command.err
│       ├── [ 882]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.6K]  .command.run
│       ├── [ 290]  .command.sh
│       ├── [ 206]  .command.trace
│       ├── [   1]  .exitcode
│       └── [ 105]  TruSeq3-PE.fa -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/stage/6a/bf69d267108422b641a00506fb16fe/TruSeq3-PE.fa
├── [4.0K]  78
│   └── [4.0K]  0c6a4fbc0d1a2ac28baa1e8a1435c0
│       ├── [432K]  ACBarrie.bam
│       ├── [ 113]  ACBarrie_R1.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/41/a6c03378bb0a5b008905e0dd599156/ACBarrie_R1.paired.fastq.gz
│       ├── [ 113]  ACBarrie_R2.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/41/a6c03378bb0a5b008905e0dd599156/ACBarrie_R2.paired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [1.2K]  .command.err
│       ├── [1.2K]  .command.log
│       ├── [   0]  .command.out
│       ├── [ 10K]  .command.run
│       ├── [ 164]  .command.sh
│       ├── [ 210]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [  99]  subregion.amb -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e7/708f935b311acc6f8133a081527431/subregion.amb
│       ├── [  99]  subregion.ann -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e7/708f935b311acc6f8133a081527431/subregion.ann
│       ├── [  99]  subregion.bwt -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e7/708f935b311acc6f8133a081527431/subregion.bwt
│       ├── [  99]  subregion.pac -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e7/708f935b311acc6f8133a081527431/subregion.pac
│       └── [  98]  subregion.sa -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e7/708f935b311acc6f8133a081527431/subregion.sa
├── [4.0K]  e1
│   └── [4.0K]  31ebd67396d6731681750b5be54245
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.3K]  .command.run
│       ├── [ 131]  .command.sh
│       ├── [ 216]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [ 119]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e9/8f8adba927938f235626b924afc564/iwgsc_refseqv1.0_chr4A.fsa.zip.gz
│       ├── [  24]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz.fai
│       ├── [181K]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz.gzi
│       └── [ 17K]  subregion
├── [4.0K]  e7
│   └── [4.0K]  708f935b311acc6f8133a081527431
│       ├── [   0]  .command.begin
│       ├── [ 480]  .command.err
│       ├── [ 480]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.2K]  .command.run
│       ├── [  45]  .command.sh
│       ├── [ 201]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [  95]  subregion -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/e1/31ebd67396d6731681750b5be54245/subregion
│       ├── [  33]  subregion.amb
│       ├── [  56]  subregion.ann
│       ├── [ 57K]  subregion.bwt
│       ├── [ 14K]  subregion.pac
│       └── [ 28K]  subregion.sa
├── [4.0K]  e9
│   └── [4.0K]  8f8adba927938f235626b924afc564
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.3K]  .command.run
│       ├── [ 120]  .command.sh
│       ├── [ 229]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [  96]  iwgsc_refseqv1.0_chr4A.fsa.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded/iwgsc_refseqv1.0_chr4A.fsa.zip
│       └── [216M]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz
├── [4.0K]  f8
│   └── [4.0K]  aa87cc036dc404ec680680451e617b
│       ├── [ 109]  ACBarrie_R1_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/33/4f4a2428ed6d0519ebf4c07af454dc/ACBarrie_R1_fastqc.html
│       ├── [ 108]  ACBarrie_R1_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/33/4f4a2428ed6d0519ebf4c07af454dc/ACBarrie_R1_fastqc.zip
│       ├── [ 109]  ACBarrie_R2_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/33/4f4a2428ed6d0519ebf4c07af454dc/ACBarrie_R2_fastqc.html
│       ├── [ 108]  ACBarrie_R2_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/33/4f4a2428ed6d0519ebf4c07af454dc/ACBarrie_R2_fastqc.zip
│       ├── [   0]  .command.begin
│       ├── [ 397]  .command.err
│       ├── [ 397]  .command.log
│       ├── [   0]  .command.out
│       ├── [9.8K]  .command.run
│       ├── [  26]  .command.sh
│       ├── [ 223]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [4.0K]  multiqc_data
│       │   ├── [126K]  multiqc_data.json
│       │   ├── [ 807]  multiqc_fastqc.txt
│       │   ├── [ 399]  multiqc_general_stats.txt
│       │   ├── [ 12K]  multiqc.log
│       │   └── [ 316]  multiqc_sources.txt
│       └── [1.1M]  multiqc_report.html
└── [4.0K]  stage
    └── [4.0K]  6a
        └── [4.0K]  bf69d267108422b641a00506fb16fe
            └── [  93]  TruSeq3-PE.fa

23 directories, 132 files

Example workflow run

Refresh page to see the embedded asciicast or go to https://asciinema.org/a/233197

Configuration file(s)

  • nextflow.config
  • $HOME/.nextflow/config
  • But also
    includeConfig 'conf/publish.config'
    • Extend config by passing additional file at run time -c additional.config
    • Ignore default and use custom config file passed at run time -C custom.config
  • Config scopes e.g. env, params, process, docker
  • Config profiles (!)

Recall: separation of workflow logic from compute, software envs

  • Much about software, execution environments can be defined in the directive declarations block at the top of the process body e.g.
params {
  take = 1 //can be overwritten at run-time e.g. --take 2 to just process first two accessions or --take all to process all
  accessions = [
    "ACBarrie",
    "Alsen",
    "Baxter",
    "Chara",
    "Drysdale",
    "Excalibur",
    "Gladius",
    "H45",
    "Kukri",
    "Pastor",
    "RAC875",
    "Volcanii",
    "Westonia",
    "Wyalkatchem",
    "Xiaoyan",
    "Yitpi"
  ]

  adapters = "https://raw.githubusercontent.com/timflutre/trimmomatic/master/adapters/TruSeq3-PE.fa"

  ref {
    base_url = "https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/iwgsc_refseqv1.0_"
    chr      = "chr4A"
    start    = "688055092"
    end      = "688113092"
  }
  bam {
    base_url = "http://crobiad.agwine.adelaide.edu.au/dawn/jbrowse-prod/data/wheat_parts/minimap2_defaults/whole_genome/PE/BPA"
    chr      = "chr4A_part2"
    start    = "235500000"
    end      = "235558000"
  }
  outdir = "./results" //can be overwritten at run-time e.g. --outdir dirname
  infodir = "./flowinfo" //can be overwritten at run-time e.g. --infodir dirname
}

process {
  cache = 'lenient'
}

profiles {
  //SOFTWARE
  conda {
    process {
      conda = "$baseDir/conf/conda.yaml"
    }
  }
  condamodule {
    process.module = 'miniconda3/4.3.24'
  }
  docker {
    process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    docker {
      enabled = true
      fixOwnership = true
    }
  }
  singularity {
    process {
      container = 'shub://csiro-crop-informatics/nextflow-embl-abr-webinar' //Singularity hub
      // container = 'rsuchecki/nextflow-embl-abr-webinar' //pulled from Docker hub - would suffice but Singularity container is re-built from docker image so not ideal for reproducibility
      //scratch = true //This is a hack needed for singularity versions approx after 2.5 and before 3.1.1 as a workaround for https://github.com/sylabs/singularity/issues/1469#issuecomment-469129088
    }
    singularity {
      enabled = true
      autoMounts = true
      cacheDir = "singularity-images"  //when distibuting the pipeline probably should point under $workDir
    }
  }
  singularitymodule {
    process.module = 'singularity/3.2.1' //Specific to our cluster - update as required
  }
  //EXECUTORS
  awsbatch {
    aws.region = 'ap-southeast-2'
    process {
      executor = 'awsbatch'
      queue = 'flowq'
      process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    }
    executor {
      awscli = '/home/ec2-user/miniconda/bin/aws'
    }
  }
  slurm {
    process {
      executor = 'slurm'
    }
  }
}

//PUBLIS RESULTS
params.publishmode = "copy"
includeConfig 'conf/publish.config'

//GENERATE REPORT https://www.nextflow.io/docs/latest/tracing.html//trace-report
report {
    enabled = true
    file = "${params.infodir}/report.html"
}

//GENERATE TIMELINE https://www.nextflow.io/docs/latest/tracing.html//timeline-report
timeline {
    enabled = true
    timeline.file = "${params.infodir}/timeline.html"
}

//GENERATE PIPELINE TRACE https://www.nextflow.io/docs/latest/tracing.html//trace-report
trace {
    enabled = true
    file = "${params.infodir}/trace.txt"
}

//GENERATE GRAPH REPRESENTATION OF THE PIPELINE FLOW
dag {
    enabled = true
    file = "${params.infodir}/flowchart.dot"
    // file = "${params.infodir}/flowchart.png" //requires graphviz for rendering
}

Configuration profiles

Setting up software environment(s)

  • Global software env for the workflow
  • Separate software envs for individual processes or subsets
  • A bit of both
  • Our example workflow:
    • global Conda env -> Docker -> Singularity

Software environment (Conda)

  • can be
    • used directly
    • to build a container
    • slow…
name: tutorial
channels:
 - bioconda
 - conda-forge
 - default
dependencies:
 - fastqc=0.11.8
 - multiqc=1.7
 - trimmomatic=0.36
 - pigz=2.3.4
 - bwa=0.7.17
 - samtools=1.9
 - htslib=1.9
 - unzip=6.0
 - tabix=0.2.6
 - gnu-wget=1.18
profiles {
  conda {
    process {
      conda = "$baseDir/conf/conda.yaml"
    }
  }
}

Software environment (Docker)

  • Container image (automated build) on Docker Hub Docker Pulls
  • Need not be conda-based
  • Dockerfile
FROM rsuchecki/miniconda3:4.5.12

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

LABEL maintainer="Rad Suchecki <rad.suchecki@csiro.au>"
SHELL ["/bin/bash", "-c"]

COPY conf/conda.yaml /
RUN conda env create -f /conda.yaml && conda clean -a
ENV PATH /opt/conda/envs/tutorial/bin:$PATH
  docker {
    process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    docker {
      enabled = true
      fixOwnership = true
    }
  }
  • NF pulls the container image from Docker Hub when our pipeline is run with -profile docker (or -profile awsbatch)

Software environment (Singularity)

  • Singularity can pull from Docker Hub and convert to its format
  • Dedicated build on Singularity Hub https://www.singularity-hub.org/static/img/hosted-singularity–hub-%23e32929.svg
    • ensures the same image is used (reproducibility!)
    • Need not be Docker based
    • Build automation less flexible than on Docker Hub
    • Singularity recipe
Bootstrap:docker
From:rsuchecki/nextflow-embl-abr-webinar:latest
  singularity {
    process.container = 'shub://csiro-crop-informatics/nextflow-embl-abr-webinar' //Singularity hub
    singularity {
      enabled = true
      autoMounts = true
    }
  }
  • NF pulls the container image from Singularity Hub when our pipeline is run with -profile singularity

Pipeline outputs:

  • The publishDir directive
    • define end products of a pipeline
    • make them easily accessible
process {
  withName: multiqc_raw {
    publishDir {
      path = "${params.outdir}/qc_raw"
      mode = "${params.publishmode}"
    }
  }
  withName: multiqc_trimmed {
    publishDir {
      path = "${params.outdir}/qc_trimmed"
      mode = "${params.publishmode}"
    }
  }
  withName: bwa_mem {
    publishDir {
      path = "${params.outdir}/bwa"
      mode = "${params.publishmode}"
    }
  }
  //Currently not applied, add:
  //label 'stats'
  //at the top of a process definition to store declared outputs as follows
  withLabel: stats {
    publishDir {
      path = "${params.outdir}/stats"
      mode = "${params.publishmode}"
    }
  }
}

Pipeline outputs

results
├── bwa
│   └── ACBarrie.bam
├── downloaded
│   └── iwgsc_refseqv1.0_chr4A.fsa.zip
├── downloaded_reads
│   ├── ACBarrie_R1.fastq.gz
│   └── ACBarrie_R2.fastq.gz
├── flowinfo
│   ├── report.html
│   ├── timeline.html
│   ├── trace.txt
│   └── trace.txt.1
├── qc_raw
│   ├── multiqc_data
│   └── multiqc_report.html
└── qc_trimmed
    ├── multiqc_data
    └── multiqc_report.html

8 directories, 10 files

AWS Batch execution

Refresh page to see the embedded asciicast or go to https://asciinema.org/a/233421

Workflow introspection

NF Resources

Acknowledgments

Twitter Follow